-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor and optimize a bit robots parsing module #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yves Bastide <[email protected]>
Signed-off-by: Yves Bastide <[email protected]>
Signed-off-by: Yves Bastide <[email protected]>
Signed-off-by: Yves Bastide <[email protected]>
Signed-off-by: Yves Bastide <[email protected]>
Signed-off-by: Yves Bastide <[email protected]>
Signed-off-by: Yves Bastide <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors and optimizes the robots.txt parsing module by consolidating logic, improving performance characteristics, and reducing external dependencies.
Changes:
- Merged
CheckAllowandCheckDisallowmethods into a singleCheckmethod to eliminate code duplication - Replaced
MatchAllowFastandMatchDisallowFastwith a unifiedMatchFastmethod that returns match length instead of boolean - Optimized pattern matching by introducing a fast path for non-wildcard patterns and moving index.htm normalization to rule creation time
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| RobotsTxt/Extensions.cs | Made MyExtensions class internal and added early-exit optimizations for case-insensitive comparisons |
| RobotsTxt/LongestMatchRobotsMatchStrategy.cs | Consolidated matching methods into a single MatchFast that returns match length; replaced boolean-based logic with length-based matching |
| RobotsTxt/RobotsMachine.cs | Merged CheckAllow/CheckDisallow into unified Check method; moved index.htm normalization to rule creation; inlined user agent comparison; changed to length-based pattern matching |
| RobotsTxt/RobotsTxt.csproj | Removed System.Memory package dependency |
| TestRobotsTxt/TestLongestMatchRobotsMatchStrategy.cs | Updated tests to verify match length instead of boolean result |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [InlineData("/", "/", true, 1)] | ||
| [InlineData("/", "/$", true, 2)] | ||
| [InlineData("a", "b", false, -1)] | ||
| [InlineData("/foo/bar", "/bar", false, -1)] |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case verifies that substring matching doesn't occur when it shouldn't (e.g., '/bar' should not match '/foo/bar'). However, there's no corresponding test for the opposite scenario where a pattern should match when it appears at the start of the path. Consider adding a test case like [InlineData("/bar/foo", "/bar", true, 4)] to ensure the fast path correctly matches path prefixes.
| if (userAgent.Length != ua.Length) continue; | ||
| bool match = true; | ||
| for (int i = 0; i < ua.Length; i++) | ||
| { | ||
| byte a = userAgent[i]; | ||
| byte b = ua[i]; | ||
| if (a == b || (a >= 'A' && a <= 'Z' && a + 32 == b) || (b >= 'A' && b <= 'Z' && b + 32 == a)) | ||
| continue; | ||
| match = false; | ||
| break; | ||
| } |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This manual case-insensitive comparison duplicates the logic already implemented in MyExtensions.EqualsIgnoreCase(). Consider using the existing extension method instead to reduce code duplication and improve maintainability.
RobotsTxt/RobotsMachine.cs
Outdated
| var haveWildcards = value.Length >= 1 && (value.Contains((byte)'*') || value[^1] == '$'); | ||
| var state = new AllowState(value.ToArray(), haveWildcards); | ||
|
|
||
| var fastPath = !value.ContainsAny("*$"u8); |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'fastPath' is ambiguous in this context. It's not immediately clear that this flag indicates whether the pattern lacks wildcards. Consider renaming to 'hasNoWildcards' or 'isSimplePattern' to better convey its meaning.
RobotsTxt/RobotsMachine.cs
Outdated
| var fastPath = !value.ContainsAny("*$"u8); | ||
| var state = new DisallowState(value.ToArray(), fastPath); |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'fastPath' is ambiguous in this context. It's not immediately clear that this flag indicates whether the pattern lacks wildcards. Consider renaming to 'hasNoWildcards' or 'isSimplePattern' to better convey its meaning.
| var fastPath = !value.ContainsAny("*$"u8); | |
| var state = new DisallowState(value.ToArray(), fastPath); | |
| var hasNoWildcards = !value.ContainsAny("*$"u8); | |
| var state = new DisallowState(value.ToArray(), hasNoWildcards); |
| private static bool Matches(ReadOnlySpan<byte> path, ReadOnlySpan<byte> pattern) | ||
| { | ||
| Span<int> pos = stackalloc int[path.Length + 1]; |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Matches method is now private but appears to be incomplete in the diff. The method allocates a stack buffer but the implementation details are cut off. Ensure that this method body is complete and correctly implements the wildcard matching logic that was previously in MatchesFast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [InlineData("/", "/", true, 1)] | ||
| [InlineData("/", "/$", true, 2)] | ||
| [InlineData("a", "b", false, -1)] | ||
| [InlineData("/foo/bar", "/bar", false, -1)] |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new test case validates that patterns must match from the start of the path. Consider adding a corresponding test in the 'Slow' path to ensure both implementations handle this scenario consistently.
RobotsTxt/RobotsMachine.cs
Outdated
| var haveWildcards = value.Length >= 1 && (value.Contains((byte)'*') || value[^1] == '$'); | ||
| var state = new AllowState(value.ToArray(), haveWildcards); | ||
|
|
||
| var fastPath = !value.ContainsAny("*$"u8); |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'fastPath' is ambiguous. It should be renamed to 'isSimplePattern' to match the naming used in the AllowState constructor and throughout the codebase.
| if (isSimplePattern) | ||
| { | ||
| return path.IndexOf(pattern) != -1; | ||
| return path.StartsWith(pattern) ? pattern.Length : -1; |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simple patterns (no wildcards), the original logic checked if the pattern appeared anywhere in the path using IndexOf. The new logic only checks if the path starts with the pattern. This changes the behavior for non-wildcard patterns and may break existing functionality where patterns like '/bar' should match paths like '/foo/bar'.
| return path.StartsWith(pattern) ? pattern.Length : -1; | |
| return path.IndexOf(pattern) >= 0 ? pattern.Length : -1; |
Signed-off-by: Yves Bastide <[email protected]>
5029d22 to
813ddf4
Compare
Summary
This pull request refactors and optimizes several components related to robots parsing, aiming to improve maintainability and performance.
Changes
MyExtensionsclass internal for encapsulation.CheckAllowandCheckDisallowto streamline logic.System.Memorydependency to reduce unnecessary external dependencies.RobotsMachinefor better overall performance.Checklist